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Implementation of Lexical 
Analysis 



Outline 



Specifying lexical structure using regular 
expressions 



Finite automata 

Deterministic Finite Automata (DFAs) 

Non-deterministic Finite Automata 
(NFAs) 



Implementation of regular expressions 

RegExp => NFA => DFA => Tables 
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How regular expressions are used to 
construct a full lexical specification on a 
programming language? 



Notation 



Ay least one: A + 

Union: A | B 
Option: A + s 
Range: 'a'+'b'+...+'z' 

Excluded range: 

complement of [a-z] 



S AA* 

= A + B 
s A? 

55 [a-z] 

= [ A a-z] 
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Regular 
Expressions 
in Lexical 
Specification 



A specification for the predicate 

s e L(R) ????? 

What we want to do? 

We want to know whether a given string is in 
this language or not. 

But a yes/no answer is not enough! 

Instead: partition the input string into tokens. 
Each one of these tokens is in L(R). 



We adapt regular expressions to this goal 



Dr. Sherin ElGokhy 



Regular 
Expressions 
=> Lexical 
Spec. (1) 



Steps to construct a full lexical specification 
from regular expressions 

Write a rexp for the lexemes of each 
token 

Number = digit + 

Keyword = 'if' + 'else' + ... 

Identifier = letter (letter + digit)* 
OpenPar= '(' 
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Regular 
Expressions 
=> Lexical 
Spec. (2) 



Construct R, matching all lexemes for all 
tokens 



R = Keyword + Identifier + Number + ... 

= R 1 + R 2 + ... 

The union of all the regular expressions 
forms the lexical specification of a language? 



Dr. Sherin ElGokhy 



Regular 
Expressions 
=> Lexical 
Spec. (3) 



Let input be x 1 ...x n 
For 1 < i < n check 

x^-.x, e L(R) 

If success, then we know that 

x 1 ...x i e L(Rj) for some 

Remove x 1 ...x i from input and go to (3) 

Repeat until the input string is empty (the entire input is analyzed) 
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There are ambiguities in the algorithm 



Ambiguities 

(i) 



How much input is used? What if 

x 1 ...x i e L(R) and also 
• x r ..x K e L(R) 



Rule: Pick longest possible string in L(R) 
The "maximal munch" 
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Ambiguities 

(2) 



Which token is used? 
What if 

• x^.-Xj e L(Rj) 
and also 

• x 1 ...x i e L(R k ) 



Rule: use rule listed first (j if j < k) 

Treats "if" as a keyword, not an identifier 
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Error 

Handling 



What if 

No rule matches a prefix of input ? 

Problem: Can't just get stuck ...Compilers should 
do good error handling. 

Solution: 

Write a rule matching all "bad" strings 
ERROR=[all strings not in the lexical specification] 
Put it last (lowest priority) 
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Summary 



Regular expressions provide a concise notation 
for string patterns 

Use in lexical analysis requires small extensions 
To resolve ambiguities 
To handle errors 

Good algorithms known 

Require only single pass overthe input 
Few operations per character (table lookup) 
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Finite 

Automata 



Regular expressions = specification 

Finite automata = implementation 
mechanism for regular expressions 



A finite automaton consists of 
An input alphabet X 
A finite set of states 
A start state n 

A set of accepting states ; cS 
A set of transitions state — > in P ut state 
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Transition 



Finite 

Automata 



s, — » a s. 



Is read 

In state s., on input "a' go to state s 2 
If end of input and in accepting state => accept 

Otherwise => reject 
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Finite 

Automata 

State 

Graphs 



A state 




The start state 




An accepting state 




A transition 
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ASimple 

Example 



A finite automata that accepts only "1" 
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Another 

Simple 

Example 



A finite automaton accepting any number of i's 
followed by a single o 

Alphabet: {0,1} 
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Another 

Example 



Alphabet {o,i} 

What language does this recognize? 
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Another kind of transition: s-moves 



Epsilon 

Moves 




Machine can move from state A to state B 
without reading input 

Think of s-moves as a kind of free move. 
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Deterministic 
and Non- 
deterministic 
Automata 



Deterministic Finite Automata (DFA) 
One transition per input per state 
No s-moves 



Nondeterministic Finite Automata (NFA) 

Can have multiple transitions for one input 
in a given state 

Can have 8-moves 
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Execution 
of Finite 
Automata 



A DFA can take only one path through the 
state graph 

The path is completely determined by 
the input (The machine has no choice) 



NFAs can choose 

• Whether to make s-moves 

Which of multiple transitions for a single 
input to take 
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Acceptance 

ofNFAs 




• Input: 



0 o 



Rule: NFA accepts if it can get to a final state 
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NFAs and DFAs recognize the same set of 
languages (regular languages) 

NFAvs. 

DFA(i) 

DFAs are faster to execute 

There are no choices to consider 
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Fora given language NFA can be simplerthan 
DFA 



NFA vs. 
DFA (2) 



NFA 



DFA 




DFA can be exponentially larger than NFA 
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Regular 
Expressions 
to Finite 
Automata 



High-level sketch 

NFA 



Regular 

expressions 




Lexical 

Specification 



Table-driven 
Implementation of DFA 
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Regular 

Expressions 

toNFA(i) 



For each kind of rexp, define an NFA 
Notation: NFA for rexp M 



M 



• For S 




• For input Q 
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• ForAB 




Regular 
Expressions 
to N FA (2) 



• For A + B 
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Regular 
Expressions 
to N FA (3) 




8 
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Example 
of ReqExp 
-> NFA 
conversion 



Consider the regular expression 

(i+o)*i 



The NFA is 




In 
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Quiz 



Choose the NFA that accepts the following 
regular expression: l* + 0 




30 



NFAto 

DFA. 




1 
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Simulate the NFA 






NFA to 
DFA: The 
Trick 



Each state of DFA 

= a non-empty subset of states of the NFA 
Start state 

= the set of NFA states reachable through 
s-moves from NFA start state 

Add a transition >-» a S'toDFAiff 

S' is the set of NFA states reachable from 
any state in after seeing the input a, 
considering s-moves as well 
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An NFA may be in many states at any time 



NFAto 

DFA. 

Remark 



How many different states ? 



If there are N states, the NFA must be in some 
subset of those N states 



How many subsets are there? 
2 N - 1 = finitely many 
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8 
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35 



A DFA can be implemented by a 2D table T 
One dimension is "states" 

Other dimension is "input symbol" 

For every transition j— define [i,a] = 

Implementation 

If in state and input a, read [i,a] = kand 
skip to state S k 

Very efficient 
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Implementation 

(Cont.) 



NFA -> DFA conversion is at the heart of tools 
such as flex 



But, DFAs can be huge 

In practice, flex-like tools trade off speed for 
space in the choice of NFA and DFA 
representations 

DFA : faster, less compact 

NFA: slower, consumes less memory 
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